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Abstract 

Support vector machine (SVM) training is an active research area since 
the dawn of the method. In recent years there has been increasing inter- 
est in specialized solvers for the important case of linear models. The 
algorithm presented by Hsieh et al., probably best known under the name 
of the "liblinear" implementation, marks a major breakthrough. The 
method is analog to established dual decomposition algorithms for train- 
ing of non-linear SVMs, but with greatly reduced computational com- 
plexity per update step. This comes at the cost of not keeping track of 
the gradient of the objective any more, which excludes the application 
of highly developed working set selection algorithms. We present an al- 
gorithmic improvement to this method. We replace uniform working set 
selection with an online adaptation of selection frequencies. The adap- 
tation criterion is inspired by modern second order working set selection 
methods. The same mechanism replaces the shrinking heuristic. This 
novel technique speeds up training in some cases by more than an order 
of magnitude. 
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1 Introduction 



Since the pioneering work of Joachims [8] and Piatt [12], support vector ma- 
chine (SVM) training is dominated by decomposition algorithms solving the 
dual problem. This approach is at the core of the extremely popular software 
libsvm, although a number of important algorithmic improvements have been 
added over the years [1]. 

For linear SVM training, the situation is different. It has been observed 
that a direct representation of the primal weight vector is computationally ad- 
vantageous over optimization of dual variables, since the dimensionality of the 
optimization problem is independent of the number of training patterns. At 
first glance this hints at solving the primal problem directly, despite the non- 
differential nature of the hinge loss. Influential examples of this research di- 
rection are the cutting plane approach [9] and the stochastic gradient descent 
"Pegasos" algorithm [14]. 

Hsieh et al. [7] were the first to notice that it is possible to solve the dual 
problem with a decomposition algorithm similar to those used for non-linear 
SVM training while profiting from the fixed dimensionality of the weight vector. 
This method combines fast (linear) convergence of its kernelized counterpart and 
a direct representation of the weight vector, which allows to perform update 
steps in time independent of the size of the training set. The method has 
been demonstrated to outperform several algorithms for direct optimization of 
the primal problem [7J. We refer the reader to the excellent review |16j and 
references therein for a detailed discussion of the differences, as well as for the 
relation of the method to non-linear (kernelized) SVM training. 

Our study builds upon this work. We present an algorithmic improvement 
of the dual method [7J . This new method differs from the existing algorithm in 
two aspects, namely the selection of the currently active sub-problem and the 
shrinking heuristic. The selection of the working set, defining the sub-problem 
in the decomposition algorithm, has been subject to extensive research [TOl l4l [5] . 
However, these elaborate methods are not affordable in the algorithm [7J, and 
they are replaced with systematic sweeps over all variables. We propose a more 
elaborate method that takes recent experience into account and adapts selection 
frequencies of individual variables accordingly. As a side effect, this algorithm 
replaces the existing shrinking heuristic [8] [7j. Our experimental evaluation 
shows that this new method can achieve considerable speed-ups of more than 
an order of magnitude. 

The remainder of this paper is organized as follows. In the next section we 
review the dual training algorithm by 7 and introduce our notation. Then we 
present our modifications in section [3] and an extensive experimental evaluation 
thereof in section [4] We discuss the results (section [5]) and close with our 
conclusions (section [6]) . 
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2 Linear SVM Training in the Dual 



In this section we describe the dual algorithm for linear SVM training [7], as 
implemented in the software liblinear [3]. It should be noted that the liblinear 
software supports a large number of training methods for linear models, such 
as binary and multi-category classification and regression, as well as different 
regularizers and loss functions. In this study we restrict ourselves to the most 
basic case, which is the "standard" SVM with hinge loss and two-norm regular- 
izer. However, our proceeding is general in nature and therefore applicable to 
many of the above cases. 

Given a binary classification problem described by training data 



SVM training corresponds to finding the solution of the (primal) optimization 



where L(y,f(x)) = max{0, 1 — y ■ f(x)} is the hinge loss and C > controls 
the solution complexity [21 113) . The prediction of the machine is of the form 
h : R d — > { — 1,+1}, h(x) = sign(/(x)), based on the linear decision function 
f(x) — {w, x). In this formulation we have dropped the constant offset b that is 
often added to the decision function, since it turns out to be of minor importance 
in high dimensional feature spaces, and dropping the term results in attractive 
algorithmic simplifications (see, e.g.[T5]). 

The corresponding dual optimization problem is the box constrained quadratic 
program 



It holds w = JV_i Vi a i x i- Since the dual training method is rooted in non-linear 
SVM training, we want to mention that in general all inner products (xi , Xj ) 
between training examples are replaced with a usually non-linear Mercer kernel 
function k(xi,Xj). 

A standard method for support vector machine training is to solve the dual 
problem with a decomposition algorithm [111 [5J [T] . The algorithm decomposes 
the full quadratic program into a sequence of sub-problems restricted to few 
variables. The sub-problems are solved iteratively until an overall solution of 
sufficient accuracy is found. Sequential minimal optimization (SMO, pj]) refers 
to the important special case of choosing the number of variables in each sub- 
problem minimal. For the above dual this minimal number is one, so that the 



({x 1 ,y 1 ),...,(x e ,y i )) G (R d x {-!,+!}) 



problem 



mm 

w£R d 




max 




s.t. < a, < C Vz G {!,..., 1} . 
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Algorithm 1 SMO without equality constraint 
repeat 

select active variable i 
solve sub-problem restricted to variable i 
update oti and further state variables 
until (all KKT violation < e) 



algorithm essentially performs coordinate ascent. The skeleton of this method 
is shown in algorithm Q] 

The training algorithm for linear SVMs by |7] is an adaptation of this tech- 
nique. Its crucial algorithmic improvement over standard SMO is to reduce the 
complexity of each iteration from 0(£-d) to only O(rf)0 The key trick is to keep 
track of the primal vector w during the optimization. This allows to rewrite the 
derivative of the dual objective function as 

dW ^ 

1 j=l 

which can be computed in 0(d) operations. The requirement to perform all 
steps inside the SMO loop in 0(d) operations makes some changes necessary, 
as compared to standard SMO. For instance, the flat SMO loop is split into an 
outer and an inner loop. The full algorithm is provided in detail as algorithm[2j 

Most prominently, the selection of the active variable i £ {1, . . . , £} (defining 
the sub-problem to be solved in the current iteration) cannot be done with 
elaborate heuristics that are key to fast training of non-linear SVMs [lOl IH [5] . 
Instead, the algorithm performs systematic sweeps over all variables. The order 
of variables is randomized. The (amortized) complexity of selecting the active 
variable is 0(1). 

The solution of the sub-problem amounts to 



1 dW 

\xi\\ 2 don 



i c 



1 - yi{xi,w) 



1 c 



INI 2 



where [x] b a = max{a, min{6, x}} denotes clipping to the interval [a, b]. This 
operation, requiring two inner products, is done in O(d) operations (in the 
implementation the squared norm is precomputed) . 

The usual SMO proceeding to keep track of the dual gradient V a W(a) is not 
possible within the tight budget of O(d) operations. Instead the weight vector 
is updated. Let /i = af cw — a° ld denote the step performed for the solution of 
the sub- problem, then the weight update reads w <— w + ji ■ y% ■ Xi, which is an 
0(d) operation. 



1 The algorithm is particularly efficient for sparse inputs. Then the complexity of O(d) is 
further reduced to 0(nnz), where nnz is the number of non-zero components of the currently 
selected training example Xi. 
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Algorithm 2 "liblinear" algorithm 



A <- {1, ... ,£}; <f n <- -oo; <a X <- oo 
loop 

Wmin OO; V max i OO 

for alii G A in random order do 
9% <~ 1 ~ V% ■ (xi,w) 
if a, = and o, < «°!f n then 

* 6 min 

A^A\{i} 
else if a, = C and > v^ x then 

else 

if (ati > and #j < w min ) then u min «- ^ 
if (a, < C and gi > u max ) then u max «- #i 

A* <- [wiMitr 

«i <— Ckj + /Lt 
w <— u> + /Lt • j/i • 
end if 
end for 

if ("max - «min < £) then 

if (A = {1, ... , 4) then break 

A^{l,...,f};<f n ^-oo; w^^-oo 
else 

if Wmin < then i£jf n <- v min else v£]f n < oo 

if Wmax > then v°]^ «- u max else «- oo 



J max 

end if 



end loop 
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The usual stopping criterion is to check the maximal violation of the Karush- 
Kuhn- Tucker (KKT) optimality conditions. For the dual problem the KKT 
violation can be expressed independently for each variable. Let gi = denote 
the derivative of the dual objective. Then the violation is \gi\ if < a$ < C, 
max{0,(/i} for a, : = 0, and max{0, — g{\ for a.j = C. The algorithm is stopped 
as soon as this violations drops below some threshold e (e.g., e = 0.001). 

In the original SMO algorithm this check is a cheap by-product of the se- 
lection of the index i. Since keeping track of the dual gradient is impossible, 
the exact check becomes an 0(£ ■ d) operation. This is the complexity of a 
whole sweep over the data. In algorithm [2] the exact check is therefore replaced 
with an approximate check, where each variable is checked at the time it is 
active during the sweep. Thus, all variables are checked, but not exactly at the 
time of stopping. The algorithm keeps track of v m - m = niinj^j | on > 0} and 
u max = niaxjgj | ctj < C}, and check for w max — v min < e at the end of the sweep. 

To exploit the sparsity of the SVM solution, the algorithm is equipped with 
a shrinking heuristic. This heuristic removes a variable from the set A of active 
variables if it is at the bounds and the gradient of the dual objective func- 
tion indicates that it will stay there. After a while, this heuristic can remove 
most variables from the problem, making sweeps over the variables much faster. 
The drawback of this heuristic is that it can fail. Therefore, at the end of the 
optimization run, the algorithm needs to check optimality of the deactivated 
variables. The detection of a mistake results in continuation of the loop, which 
can be costly. Therefore, the decision to remove a variable needs to be conserva- 
tive. The algorithm removes a variable only if it is at a bound and its gradient 
gi pushes against the bound with a strength that exceeds the maximal KKT 
violation of active variables during the previous sweep. This amounts to the 
condition gi < v^f n for on — and to gi > v^ x for cm = C. 

3 Online Adaptation of Variable Selection Fre- 
quencies 

In this section we introduce our algorithmic improvement to the above described 
linear SVM training algorithm. Our modification targets two weaknesses of the 
algorithm at once. 

• Algorithm [2] executes uniform sweeps over all active variables. In contrast 
to the SMO algorithm for non-linear SVM training, the selection is not 
based on a promise of the progress due to this choice. Although the 
computational restriction of 0(d) operations does not allow for a search 
for the best a-priori guarantee of some sort (such as the largest KKT 
violation), we can still learn from the observed progress after a step has 
been executed. 

• Shrinking of variables is inevitably a heuristic. Algorithm [2] makes "hard" 
shrinking decisions by removing variables based on adaptive thresholds on 
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the strength with which they press against their active constraints. It is 
problematic that even a single wrong decision to remove a variable early 
on can invalidate a large share of the algorithm's (fine tuning) efforts later 
on. Therefore we replace this mechanism with what we think of as "soft" 
shrinking, which amounts to the reduction of the selection frequency of a 
variable, down to a predefined minimum. 

In algorithm [2] there are only two possible frequencies with which variables 
are selected. All inactive variables are selected with frequency zero, and all 
active variables are selected with the same frequency 1/|A|. This scheme is 
most probably not optimal; it is instead expected that some variables should be 
selected much more frequently than others. 

Established working set selection heuristics aim to pick the best (in some 
sense) variable for the very next step, and therefore automatically adapt relative 
frequencies of variable selection over time. This is not possible within the given 
framework. However, we can still use the information of whether a step has 
made good progress or not to adapt selection frequencies for the future. This 
adaptation process is similar to so-called self-adaptation heuristics found in 
modern direct search methods, see e.g. [6]. To summarize, although we are 
unable to determine the best variable for the present step, we can still use 
current progress as an indicator for future utility. 

For turning this insight into an algorithm we introduce adaptive variable 
selection frequencies based on preference values pi > 0. The relative frequency 
of variable on is defined as 

Pi 



E Pi 

3=1 



In each iteration of the outer loop the algorithm composes a schedule (a list of I 
variables indices) that reflects these relative frequencies. This task is performed 
by algorithm SJ With 0(£) operations it is about as cheap as the randomization 
of the order of variables. We call this novel variable selection scheme adaptive 
variable selection frequencies (AVSF). 

The crucial question is: how to update the preferences pi over the course of 
the optimization run? For this purpose the gain A = W(a new ) — W(a° ld ) of 
an iteration with active variable on is compared to the average (reference) gain 
A re f. Since the average gain decreases over time, this value is estimated as a 
fading average. The preference is changed by the rule 



Pi <- 



Pi ■ exp (c • (A - Aref)) 



In our experiments we set the constants to c = 1/5, p m i n = 1/20, andp max = 20. 
The bounds < p m m < Pmax < oo ensure that the linear convergence guarantee 
established by Theorem 1 in [7] directly carries over to our modified version. 
The adaptation of preference values is taken care of by algorithm [5] The added 
complexity per iteration is only 0(1). 
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The dual objective gain A is used in modern second order working set selec- 
tion algorithms for non- linear SVM training [H [5] . Our method resembles this 
highly efficient approach; it can be understood as a time averaged variant. 

It is important to note that the above scheme does not only increase the pref- 
erences and therefore the relative frequencies of variables that have performed 
above average in the past. It also penalizes variables that do not move at all, 
typically because they are at the bounds and should be removed from the active 
set: such steps give the worst possible gain of zero. Thus, the algorithm quickly 
drives their preferences to the minimum. However, they are not removed com- 
pletely from the active set. Checking these variables from time to time is a good 
thing, because it is cheap compared to uniform sweeps on the one hand and at 
the same time avoids that early mistakes are discovered only very much later. 

Another difference to the original algorithm is that shrinking decisions are 
not based on KKT violations, but instead on relative progress in terms of the 
dual objective function. We are not aware of an existing approach of this type. 

Algorithm[3]incorporates our modifications into the liblinear algorithm. The 
new algorithm is no more complex than algorithm [21 and it requires only a hand 
full of changes to the existing liblinear implementation. 



Algorithm 3 Linear SVM algorithm with adaptive variable selection frequen- 
cies (AVSF) 

P^(i,.,i)eR'; p B um<-e 

A ref <- 

canstop <— true 
loop 

V <- 

define schedule I £ {1, ... , £} e (algorithm |4]) 
for all i G / in random order do 
Qi 4- 1 - Di ■ (Xi,w) 

if (ctfj > and — gi > v) then v < gi 

if (a, < C and gi > v) then v 4— gi 

A 1 <- [ft/INI 2 ] ' 

Q4 <- Cti+ H 

w <— w + fi ■ yi ■ Xi 

update preferences (algorithm [5]) 
end for 
if v < e then 

if canstop then break 

p<- (1,...,1) el'; p sum <-t 

canstop <— true 
else 

canstop <— false 
end if 
end loop 
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Algorithm 4 Definition of the schedule / 
N <- p sum 
j <- 

for i G {1, ■■■,£} do 

m <- pi • - j)/iV 
n «— [mj 

with probability to — n: n <— n + 1 
for fc e {1, . . . , n} do 

j <- i + 1 

end for 

JV <- AT - ^ 
end for 



Algorithm 5 Update of the preferences p 
A <^ fi ■ ( gi - n/2 ■ \\ Xi \\ 2 ) 
if first sweep then 

A ref <- A rcf + A/£ 
else 

/l C • (A/ Aref - 1) 

[/> "1 Pmax 

Psum ^ Psum ~t"~ Pncw Pi 
Pi ^ Pnew 

A rcf ^-(l-l/^)A rof + A/£ 
end if 
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Problem 



Instances (£) 



Features (d) 



cover type 

kkd-a 

kkd-b 

news 20 

rcvl 

url 



581,012 
8,407,752 
19,264,097 



20,216,830 
29,890,095 
1,355,191 



54 



19,996 
20, 242 
2,396,130 



47,236 
3,231,961 



Table 1: Number of training examples and number of features of the data sets 
used in our comparison. 



4 Experimental Evaluation 

We compare our adaptive variable frequency selection algorithm [3] (AVSF) to 
the baseline algorithm [5] in an empirical study. For a fair comparison we have 
implemented our modifications directly into the latest version of liblinear (ver- 
sion 1.92 at the time of writing). 

The aim of the experiments is to demonstrate the superior training speed of 
algorithm [3] over a wide range of problems and experimental settings. Therefore 
we have added time measurement and a step counter to both algorithms @ 

The liblinear software comes with a hard-coded limit of 1000 outer loop 
iterations. We have removed this "feature" for the sake of comparison. Instead 
we stop only based on the heuristic stopping criterion described in section [21 
which is the exact same for both algorithms. We use the liblinear default of 
e = 0.01 as well as the libsvm default of e = 0.001 in all experiments. 

We ran both algorithms on a number of benchmark problems. In our com- 
parison we rely on medium to extremely large binary classification problems, 
downloadable from the libsvm data website: 

[http : / /www . csie .ntu . edu . tw/~c j lin/libsvmtools/d.atasets/| 

Table [T] lists descriptive statistics of the data sets. 

Test accuracies are of no relevance for our comparison, since both algorithm 
deliver the same solution. Only the runtime matters. Comparing training times 
in a fair way is non-trivial. This is because the selection of a good value of the 
regularization parameter C requires several runs with different settings, often 
performed in a cross validation manner. Therefore the computational cost of 
finding a good value of C can easily exceed that of training the final model, and 
even a good range for C is hard to guess without prior knowledge. To circumvent 
this pitfall we have decided to report training times for a whole range of values, 
namely C £ {0.01,0.1, 1, 10, 100, 1000}. We have averaged the timings for the 
data sets news 20 and rcvl over 10 independent runs in order to obtain stable 
results. 

2 The timer measures the runtime of the core optimization loop. In particular, data loading 
is excluded. 
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Our primary performance metric is wall clock time. A related and easier 
to measure quantity is the number of update steps. For both algorithms the 
complexity of an update steps is 0(d) computations. Also, the wall clock time 
per step is roughly comparable, but we have to note that a step of the AVSF 
algorithm is slightly more costly than for the original algorithm. 

Training times and numbers of iterations are reported in tables [5] and [31 For 
an easier comparison, they are illustrated graphically in figure [TJ 

5 Discussion 

The typical behavior of the performance timing curves in figure [T] is that the 
original algorithm is superior for small values of C, and that our new algorithm 
is a lot faster for large values of C: often by an order of magnitude, sometimes 
even more. The differences are most pronounced for large data sets. In the 
following we will discuss this behavior. 

For small values of C many examples tend to become support vectors. Many 
dual variables end up at the maximum value. This relatively simple solution 
structure can be achieved efficiently with uniform sweeps through the data. 
Moreover, the algorithm performs over few outer loop iterations. In this case 
our soft shrinking method is too slow to be effective and the original hard 
shrinking heuristic has an advantage. At the same time the problem structure 
is "simple enough", so that falsely shrinking variables out of the problem is 
improbable. 

On the other hand, for large values of C the range of values is much larger and 
the values of variables corresponding to points close to or exactly on the target 
margin are tedious to adjust to the demanded accuracy. In this case shrinking 
is important, and second order working set selection is known to work best. 
Also, in this situation shrinking is most endangered to make wrong decisions, so 
soft shrinking has an advantage. Only the magnitude of the speed-up is really 
surprising. 

The forest cover data is an exception to the above scheme. Here the original 
algorithm is superior for all tested values of C, although the difference dimin- 
ishes for large C. Also, it seems odd that the training times for the different 
target accuracies are nearly identical, despite the fact that they can make huge 
differences for other data sets. The reason for this effect is most probably the 
rather low number of features d: once the right weight vector is found, is it 
easily tuned to nearly arbitrary precision. Also, since the data is distributed in 
a rather low dimensional space there are many functionally similar instances, 
which is why adaptation of frequencies of individual variables is less meaningful 
than for the other problems. 

Training times increase drastically for increasing values of C (note the loga- 
rithmic scale in the plots). Therefore we argue that improved training speed is 
most crucial for large values of C . Doubling the training time for small values 
of C does not pose a serious problem, since these times are anyway short, while 
speeding up training for large values of C by more than an order of magnitude 
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Problem 


Solver 


C = 0.01 


C = 0.1 


C = 1 


C = 10 


C = 100 


C = 1000 


cover type 


baseline 


1.29 


2.73 


12.5 


69.5 


533 


4,450 






3.31 ■ 10 6 


7.41 ■ 10 6 


3.38 ■ 10 7 


1.80 ■ 10 8 


1.37 ■ 10 9 


1.14 ■ 10 10 




AVSF 


4.86 


7.05 


18.1 


98.4 


724 


6,670 






8.72 ■ 10 6 


1.28 ■ 10 7 


3.43 ■ 10 7 


1.88 ■ 10 8 


1.50 ■ 10 9 


1.40 ■ 10 10 


kkd-a 


baseline 


429 


2,340 


31,200 


138, 000 


345, 000 








3.07 ■ 10 8 


1.57 ■ 10 9 


1.88 ■ 10 10 


8.77 ■ 10 10 


2.35 ■ 10 11 






AVSF 


473 


858 


2,090 


7, 880 


36, 100 








3.62 ■ 10 s 


6.39 ■ 10 s 


1.58 ■ 10 9 


6.15 ■ 10 9 


5.01 ■ 10 10 




kkd-b 


baseline 


1,150 


5,140 


53, 300 


400, 000* 


932, 000* 








6.92 ■ 10 8 


2.86 ■ 10 9 


3.11 ■ 10 10 










AVSF 


1,510 


3, 140 


3,820 


14, 200 


166, 000 








7.90 ■ 10 8 


1.52 ■ 10 9 


2.64 ■ 10 9 


7.22 ■ 10 9 


8.32 ■ 10 10 




news 20 


baseline 


0.56 


0.60 


2.30 


3.56 


7.39 


100 






8.03 ■ 10 4 


1.22 ■ 10 5 


4.04 ■ 10 5 


6.38 ■ 10 5 


1.38 ■ 10 6 


2.47 ■ 10 7 




AVSF 


0.77 


1.12 


2.13 


2.47 


5.15 


3.95 






1.20 ■ 10 5 


2.60 ■ 10 5 


4.80 ■ 10 5 


4.40 ■ 10 5 


8.80 ■ 10 5 


8.20 ■ 10 5 


rcvl 


baseline 


0.09 


0.13 


0.46 


1.76 


4.27 


14.1 






9.36 ■ 10 4 


1.46 ■ 10 5 


4.77 ■ 1.0 S 


1.70 ■ 10 6 


4.19 ■ 10 6 


1.43 ■ 10 7 




AVSF 


0.18 


0.27 


0.50 


0.95 


1.01 


1.46 






1.62 ■ 10 5 


2.83 ■ 10 5 


4.86 ■ 10 5 


9.72 ■ 10 5 


1.07 ■ 10 6 


1.54 ■ 10 6 


url 


baseline 


67.9 


353 


4,140 


22, 100 


121,000 


469, 000 






4.05 ■ 10 7 


1.93 ■ 10 8 


2.22 ■ 10 9 


1.45 ■ 10 10 


8.04 ■ 10 10 


2.74 ■ 10 11 




AVSF 


135 


213 


658 


1,720 


6,650 


31,300 






6.47 ■ 10 7 


1.39 ■ 10 8 


4.17 ■ 10 8 


1.18 ■ 10 9 


4.34 ■ 10 9 


1.73 ■ 10 10 



Table 2: Runtime in seconds and number of update steps (inner loop iterations, 
tiny font numbers in scientific notation below) for both methods, trained for a 
range of values of C, with target accuracy e — 0.01. Algorithm [2] is marked with 
"baseline" , the adaptive variable selection frequencies algorithm[3]with "AVSF" . 
Runs marked with " — " did not finish until the deadline. For cases where one 
algorithm has finished but the competitor has not we report the running time 
until present as a lower bound on the true runtime — these entries are marked 
with a star. We want to remark that the actual values may be much bigger. 
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rr\ r\ urn 


Sa 1 1 far 
uUlvcI 




c — n i 

w — U. 1 


C — 1 


c 1 — i n 


f - inn 

w — 1UU 


w — 1UUU 


cover type 


baseline 


1.28 


2.75 


12.5 


69.5 


597 


4,750 






3.31 cot 10 6 


7.41 ■ 10 6 


3.38 ■ 10 7 


l.SO ■ 10 8 


1.78 ■ 10° 


1.44 ■ 10 10 




AVSF 


4.82 


7.18 


18.7 


101 


724 


6, 710 






S.72 ■ 10 6 


1.28 ■ 10 7 


3.43 ■ 10 7 


1.88 ■ 10 s 


1.50 ■ 10 9 


1.40 ■ 10 10 


kkd-a 


baseline 


817 


9,660 


239, 000 


1, 800, 000* 


1, 800, 000* 








1.11 ■ 10 9 


9.16 ■ 10 9 


1.59 ■ 10 11 










AV or 


oni 
oUl 


i, y i\j 


p: a a n 
0, 44U 


( 4, OUU 


oyz, UUU 








6.22 ■ 10 8 


9.84 ■ 10 S 


4.23 ■ 10 9 


3.80 ■ 10 10 


2.80 ■ 10 11 




kdd-b 


baseline 


2,610 


20, 500 


459, 000 


1,450,000* 


2, 160,000* 








1.94 ■ 10 9 


1.17 ■ 10 10 


2.73 ■ 10 11 










AVSF 


2,930 


4,660 


16, 000 


89, 200 


820, 000 








1.23 ■ 10 9 


2.04 ■ 10 9 


7.57 ■ 10 9 


4.05 ■ 10 10 


4.09 ■ 10 11 




news 20 


baseline 


0.56 


0.78 


8.54 


9.84 


11.9 


103 






8.03 ■ 10 4 


1.54 ■ 10 5 


1.55 ■ 10 6 


1.87 ■ 10 6 


2.90 ■ 10 6 


2.50 ■ 10 7 




AVSF 


0.97 


2.13 


2.44 


4.06 


5.15 


6.20 






1.60 ■ 10 5 


3.80 ■ 10 5 


5.20 ■ 10 5 


7.20 ■ 10 5 


8.80 ■ 10 5 


1.02 ■ 10 6 


rcvl 


baseline 


0.09 


0.17 


2.74 


2.85 


4.73 


18.4 






9.40 ■ 10 4 


1.93 ■ 10 5 


3.36 ■ 10 6 


3.36 ■ 10 6 


5.63 ■ 10 6 


2.14 ■ 10 7 




AVSF 


0.16 


0.33 


0.87 


0.86 


1.26 


1.75 






1.82 ■ 10 5 


3.85 ■ 10 5 


1.01 ■ 10 6 


9.92 ■ 10 5 


1.48 ■ 10 6 


2.04 ■ 10 6 


url 


baseline 


139 


2,100 


22, 100 


135, 000 


402, 000 


703, 000 






8.27 ■ 10 7 


1.18 ■ 10 9 


1.46 ■ 10 10 


7.61 ■ 10 10 


2.35 ■ 10 11 


3.78 ■ 10 11 




AVSF 


202 


1030 


3,660 


20, 300 


33, 300 


39, 500 






9.82 ■ 10 7 


5.65 ■ 10 S 


2.48 ■ 10 9 


1.00 ■ 10 10 


2.37 ■ 10 10 


2.27 ■ 10 10 



Table 3: Runtime in seconds and number of update steps (inner loop iterations, 
tiny font numbers in scientific notation below) for both methods, trained for a 
range of values of C, with target accuracy e — 0.001. Algorithm [2] is marked 
with "baseline" , the adaptive variable selection frequencies algorithm |3] with 
"AVSF" . Runs marked with " — " did not finish until the deadline. For cases 
where one algorithm has finished but the competitor has not we report the 
running time until present as a lower bound on the true runtime — these entries 
are marked with a star. We want to remark that the actual values may be much 
bigger. 
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Figure 1: Training times with the original lib-linear algorithm (red circles) and 
with our adaptive variable selection algorithm (blue squares), as a function of 
the regularization parameter C. The target accuracy is e = 0.01 for the solid 
curve and e = 0.001 for the dashed curve. 
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can make machine training feasible in the first place. We observe this difference 
directly for problems kkd-a and kkd-b. 

This argument is most striking when doing model selection for C. Mini- 
mization of the cross validation error is a standard method. The parameter C 
is usually varied on a grid on logarithmic scale. This procedure is often a lot 
more compute intensive than the final machine training with the best value of 
C , and its cost is independent of the resulting choice of C . Its time complexity 
is proportional to the row-wise sum of the training times in the tables, i.e., over 
all values of C. This cost is clearly dominated by the largest tested value (here 
C = 1000), which is where savings due to variable selection frequencies are most 
pronounced. 

6 Conclusion 

We have replaced uniform variable selection in sweeps over the data for lin- 
ear SVM training with an adaptive approach. The algorithm extrapolates past 
performance into the future and turns this information into an algorithm for 
adapting variable selection frequencies. At the same time the reduction of fre- 
quencies of variables at the bounds effectively acts as a soft shrinking technique, 
making explicit shrinking heuristics superfluous. To the best of our knowledge 
this is the first approach of this type for SVM training. 

Our experimental results demonstrate striking success of the new method in 
particular for costly cases. For most problems we achieve speed-ups of up to an 
order of magnitude or even more. This is a substantial performance gain. The 
speed-ups are largest when needed most, i.e., for large training data sets and 
large values of the regularization constant C. 
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